This year, a report from Centers for Disease Control and Prevention revealed that America’s obesity rate has reached a record high. Contrasting to popular belief, New Yorkers are not so safe from the obesity epidemic, as more than half of adult New Yorkers are either overweight or obese. Studies show that the rise in the obesity epidemic is partly due to disparities in food environment; it is harder for some to eat healthier because their options are limited. This project intends to look deeper into the relationship between food environment in NYC and obesity rate along with diabetes rate and stroke hospitalization rate.
1. Can the percentage of certain fast-food restaurant be related to obesity, diabetes and stroke rate in different neighborhoods of NYC?
2. Can the percentage of health inspection grade A restaurant be related to obesity, diabetes and stroke rate in different neighborhoods of NYC?
3. Can the composite restaurant health score of each neighborhood be related to chronic disease rate?
For the first two questions, we stick to it. For the third question, we change the main predictor to percentage of fast food cuisine type since it was easy to acquire in the data. Composite restaurant health score were to subjective and difficult to assess. Moreover, we calculated the boro level model first due to the difficulties we encountered in scaping zipcode-neighborhood data. However, after some struggle, we still managed to get the neighborhood information and analyze the data accordingly.
1. NYC restaurant inspection:
https://data.cityofnewyork.us/Health/DOHMH-New-York-City-Restaurant-Inspection-Results/43nn-pn8j
2. 2015 Community Health Profiles Open Data:
https://www1.nyc.gov/site/doh/data/data-publications/profiles.page
These data were rather easy to acquire, since they can be downloaded by CSV format. Because health and demographic data are grouped by neighborhood but the restaurant inspection data was using zipcodes. We had to match the zipcode and neighborhood name in order to merge the datasets which was the hardest.
First, we scraped the table data from https://www.health.ny.gov/statistics/cancer/registry/appendix/neighborhoods.htm. However the neighborhood name and combinations were different from the health profile data we downloaded. We cannot merge the two datasets. Then we tried to look for the raw classification for the neigborhoods in health data from New York University’s Furman Center for Real Estate and Urban Policy and the NYC Department of City Planning. However, still no luck becuase the neighborhood was not divided by zipcode areas. Then we tried the hardest way: search the name of the neighborhood on Google and finding the matching zipcodes. It worked out well, but it was not precise, thus we left out some of the restaurants without a matching neighborhood name.
Downloading restaurant inspection data from NYC open data:
get_all_inspections = function(url) {
all_inspections = vector("list", length = 0)
loop_index = 1
chunk_size = 50000
DO_NEXT = TRUE
while (DO_NEXT) {
message("Getting data, page ", loop_index)
all_inspections[[loop_index]] =
GET(url,
query = list(`$order` = "zipcode",
`$limit` = chunk_size,
`$offset` = as.integer((loop_index - 1) * chunk_size)
)
) %>%
content("text") %>%
fromJSON() %>%
as_tibble()
DO_NEXT = dim(all_inspections[[loop_index]])[1] == chunk_size
loop_index = loop_index + 1
}
all_inspections
}
url = "https://data.cityofnewyork.us/resource/9w7m-hzhe.json"
rest_inspection = get_all_inspections(url) %>%
bind_rows()
Downloading health and demographic data CSV into local folder and reading, cleaning it:
download.file("https://www1.nyc.gov/assets/doh/downloads/excel/episrv/2015_CHP_PUD.xlsx", mode="wb", destfile = "health.xlsx")
health <- read_excel("health.xlsx", sheet = "CHP_all_data") %>%
select(Name, Racewhite_Rate, Age0to17_rate,
Age18to24_rate, Age25to44_rate, Poverty, Unemployment,
Smoking, Exercise,
Obesity, Diabetes, Stroke_Hosp) %>%
clean_names() %>%
mutate(age0to44_rate = age0to17_rate + age18to24_rate + age25to44_rate) %>%
select(-age0to17_rate, -age18to24_rate, -age25to44_rate)
Matching the neighborhood names with restaurant zipcodes:
zip_neighbor <- read_csv("neigh_zipcode.csv") %>%
mutate(zipcode = as.character(zipcode))
##restaurant data with neighbourhood
rest_neighborhood = left_join(rest_inspection, zip_neighbor, by = "zipcode") %>%
filter(!is.na(neighborhood))
health_only_neighborhood <- health[-c(1:6),] %>%
rename(neighborhood = name) %>%
mutate(neighborhood = as.factor(neighborhood))
##Plotting for outcome in different neighborhood
health_only_neighborhood %>%
mutate(neighborhood = fct_reorder(neighborhood, obesity)) %>%
filter(obesity > 25) %>%
ggplot(aes(x = neighborhood,y = obesity, fill = neighborhood)) + geom_col() +
theme(axis.text.x = element_text(angle = 45, hjust = 1),
legend.position = "none") +
labs(title = "Neighborhood with 25 percent more obesity rate")
health_only_neighborhood %>%
mutate(neighborhood = fct_reorder(neighborhood, diabetes)) %>%
filter(diabetes > 10) %>%
ggplot(aes(x = neighborhood,y = diabetes, fill = neighborhood)) + geom_col() +
theme(axis.text.x = element_text(angle = 45, hjust = 1),
legend.position = "none") +
labs(title = "Neighborhood with 10 percent more diabetes rate")
health_only_neighborhood %>%
mutate(neighborhood = fct_reorder(neighborhood, stroke_hosp)) %>%
filter(stroke_hosp > 300) %>%
ggplot(aes(x = neighborhood,y = stroke_hosp, fill = neighborhood)) +
geom_col() +
theme(axis.text.x = element_text(angle = 45, hjust = 1),
legend.position = "none") +
labs(title = "Neighborhood with 300 more stroke hospitalization in 100,000 adults")
##Plotting for some adjusted variables
health_only_neighborhood %>%
mutate(neighborhood = fct_reorder(neighborhood, poverty)) %>%
filter(poverty > 20) %>%
ggplot(aes(x = neighborhood,y = poverty, fill = neighborhood)) + geom_col() +
theme(axis.text.x = element_text(angle = 45, hjust = 1), legend.position = "none") +
labs(title = "Neighborhood with 20 percent or more poverty")
health_only_neighborhood %>%
mutate(neighborhood = fct_reorder(neighborhood, age0to44_rate)) %>%
filter(age0to44_rate > 60) %>%
ggplot(aes(x = neighborhood,y = age0to44_rate, fill = neighborhood)) +
geom_col() +
theme(axis.text.x = element_text(angle = 45, hjust = 1), legend.position = "none") +
labs(title = "Neighborhood with 60 percent or more people under age 44")
Summary: Williamsbridge and Baychester have the highest obesity rate, up to 32%. East New York and Starrett City have the highest diabetes rate, up to 17 percent. BUshwick has the highest stroke hospitalization in 100,000 adults, up to 300 adults. Morrisania and Crotona has the highest percent of poverty rate, up to 40%. Fiancial district has the highest percent of young peoeple, up to 67%.
# Identify fastfood restaurants by their cuisine types. We print out the cuisine type list (n=85) and let everyone circle the ones they think is fastfood and we use the union as our rule (use union because it's more conservative).
rest_inspection %>%
mutate(dba = str_to_upper(dba)) %>%
arrange(cuisine_description) %>%
count(cuisine_description)
# calculating the total number of restaurants and the number of fastfood restaurants in the neighborhood, as well as the percentage of fastfood restaurants.
neighborhood_list =
rest_neighborhood %>%
distinct(neighborhood) %>%
arrange(neighborhood)
rest_fastfood_neighborhood =
rest_neighborhood %>%
filter(cuisine_description %in% c("Bagels/Pretzels",
"Bottled beverages, including water, sodas, juices, etc.",
"Chicken",
"Delicatessen",
"Donuts",
"Hamburgers",
"Hotdogs",
"Hotdogs/Pretzels",
"Ice Cream, Gelato, Yogurt, Ices",
"Nuts/Confectionary",
"Pancakes/Waffles",
"Pizza",
"Soul Food",
"Sandwiches",
"Sandwiches/Salads/Mixed Buffet",
"Soups & Sandwiches"))
percent_fastfood_neighborhood = function(name_neighborhood){
rest_each_neighborhood =
rest_neighborhood %>%
filter(neighborhood == name_neighborhood) %>%
distinct(camis)
n_rest_neighborhood = nrow(rest_each_neighborhood)
rest_fastfood_distinct_neighborhood =
rest_fastfood_neighborhood %>%
filter(neighborhood == name_neighborhood) %>%
distinct(camis, cuisine_description)
n_fastfood_neighborhood = nrow(rest_fastfood_distinct_neighborhood)
percent_fastfood_neighborhood = n_fastfood_neighborhood/n_rest_neighborhood
tibble(
neighborhood = name_neighborhood,
n_fastfood = n_fastfood_neighborhood,
n_rest = n_rest_neighborhood,
percent_fastfood = percent_fastfood_neighborhood
)
}
fastfood_neighborhood =
map(neighborhood_list$neighborhood, percent_fastfood_neighborhood) %>%
bind_rows() %>%
mutate(neighborhood = str_to_upper(neighborhood))
fastfood_neighborhood %>%
mutate(neighborhood = as.factor(neighborhood),
n_rest = as.numeric(n_rest),
neighborhood = fct_reorder(neighborhood, percent_fastfood)) %>%
plot_ly(., x = ~neighborhood, y = ~n_rest, type = 'bar', name = 'total restaurants') %>%
add_trace(y = ~n_fastfood, name = 'fastfood restaurants') %>%
layout(yaxis = list(title = 'number of restaurants'),
xaxis = list(title = 'neighborhood (ordered by percentage of fastfood restaurants)',
tickangle = 45),
barmode = 'stack')
From the plot, we can see that, while the Greenwich Village and SOHO neighborhood has fairly large number of restaurants, it has the smallest percentage of fastfood restaurants. Williamsbridge and Baychester has the largest percentage of fastfood restaurants.
When large number of total restaurants is not equal to large percentage of fastfood restaurants in that neighborhood, we can conclude that the distribution of fastfood restaurants is not even across neighborhoods, which also implies the motivation of our study, we want to investigate if this uneven distribution of fastfood restaurants is associated with diffferent level of chronic disease outcomes within a neighborhood.
## Matching list of chain restaurants in U.S. and restaurant inspection data
chains_html = read_html("https://en.wikipedia.org/wiki/List_of_restaurant_chains_in_the_United_States#Fast-casual")
chain_rest = chains_html %>%
html_nodes("ul:nth-child(9) li , .jquery-tablesorter tr:nth-child(1) td:nth-child(1)") %>%
html_text() %>%
as.tibble() %>%
mutate(dba = value,
dba = str_to_upper(dba)) %>%
select(dba)
# read in the list of chain restaurants in us
# made the names to uppercase and changed the var name to dba
head(chain_rest, 10)
## # A tibble: 10 x 1
## dba
## <chr>
## 1 A&W RESTAURANTS
## 2 APPLEBEE'S
## 3 BAJA FRESH
## 4 BOSTON MARKET
## 5 BUFFALO WILD WINGS
## 6 CHILI'S
## 7 CHIPOTLE MEXICAN GRILL
## 8 CICI'S PIZZA
## 9 COLD STONE CREAMERY
## 10 CORNER BAKERY CAFE
I have scraped list of 75 national chain restaurants from the wikipedia page (https://en.wikipedia.org/wiki/List_of_restaurant_chains_in_the_United_States#Fast-casual) now I will join this dataset with restaurant inspection data to choose only the chain restaurants in NYC.
# removing punctuations in chain_rest & restaurant inspections
chain_rest_str =
chain_rest %>%
mutate(dba = str_replace_all(dba, "[[:punct:]]", ""))
rest_inspect_str = rest_inspection %>%
mutate(dba = str_replace_all(dba, "[[:punct:]]", ""))
# Matching the two datasets(restaurant inspection data that has all punctuation removed from dba(restaurant name) and list of chain restaurants by dba)
nyc_chain =
right_join(rest_inspect_str, chain_rest_str) %>%
filter(!is.na(camis)) %>%
distinct(camis, dba, boro)
## Joining, by = "dba"
nyc_chain %>%
group_by(dba) %>%
summarise(n = n()) %>%
arrange(desc(n)) %>%
head(10)
## # A tibble: 10 x 2
## dba n
## <chr> <int>
## 1 DUNKIN DONUTS 453
## 2 SUBWAY 344
## 3 STARBUCKS 282
## 4 MCDONALDS 207
## 5 CHIPOTLE MEXICAN GRILL 66
## 6 WENDYS 42
## 7 APPLEBEES 28
## 8 WHITE CASTLE 25
## 9 BOSTON MARKET 22
## 10 IHOP 21
The combined dataframe, nyc_chain has nrow(nyc_chain) observations. Also, there were 33 different chain restaurants extracted. The chart is showing the 10 chain restaurants in descending order. Next step is calculating the percentage of chain restaurants in each boro.
# counting chains in boro
count_chain = nyc_chain %>%
group_by(boro) %>%
summarise(chain_n = n())
count_rest = rest_inspection %>%
distinct(boro, camis) %>%
group_by(boro) %>%
summarise(res_n = n())
# calculating percentage of chains in each boro
percent_chain = left_join(count_chain, count_rest) %>%
mutate(chain_percentage = chain_n/res_n)
## Joining, by = "boro"
# let's see this in a graph
percent_chain %>%
mutate(boro = forcats::fct_reorder(boro, chain_percentage)) %>%
ggplot(aes(boro, chain_percentage, fill = boro)) + geom_bar(stat="identity")
We can see that Brooklyn has the lowest percentage of chain restauratns with percentage less that 5%. Next was Manhattan with 6% and the highest was Staten Island with 10% of chain restaurants.
data1 =
rest_neighborhood %>%
group_by(neighborhood, grade) %>%
summarise(n = n()) %>%
mutate(grade_percent = n / sum(n)) %>%
filter(grade == "A") %>%
ungroup(boro)
data1 %>%
mutate(neighborhood = as.factor(neighborhood),
neighborhood = fct_reorder(neighborhood,grade_percent)) %>%
plot_ly(x = ~neighborhood, y = ~grade_percent, color = ~neighborhood, type = "bar")
## Warning in RColorBrewer::brewer.pal(N, "Set2"): n too large, allowed maximum for palette Set2 is 8
## Returning the palette you asked for with that many colors
Different from the situation in borough, the difference of percentage of “grade-A” restaurants in each borough exists. “Throgs Neck and Co-op City” has the greatest grade-A restaurants percentage, around 44.50%. “Sunset Park”, however, has the least, around 30.62%. “Washington Heights”, where we live, takes the fourth from the end, 34.39%, which is obviously consistent with the feeling we have towards the restaurant condition of “Washington Heights”.
# combine the dataset with health data and run linear simple regression models
health_neighborhood =
health %>%
mutate(neighborhood = str_to_upper(name)) %>%
select(-name)
combined_neighborhood =
left_join(fastfood_neighborhood, health_neighborhood, by = "neighborhood")
# fit multiple linear regression model between the three health outcomes and percentage of fastfood restaurants after adjusting fro race, age, poverty, smoking status and exercise status
lm(obesity ~ percent_fastfood + racewhite_rate + age0to44_rate + poverty + smoking + exercise, combined_neighborhood) %>%
broom::tidy() %>%
kable()
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 25.7821960 | 13.6538571 | 1.8882720 | 0.0645740 |
| percent_fastfood | 48.7745006 | 13.8250910 | 3.5279696 | 0.0008850 |
| racewhite_rate | -0.1074704 | 0.0405247 | -2.6519723 | 0.0105840 |
| age0to44_rate | -0.0871726 | 0.1529002 | -0.5701272 | 0.5710466 |
| poverty | 0.1154709 | 0.1169470 | 0.9873783 | 0.3280294 |
| smoking | 0.7562470 | 0.2477709 | 3.0522026 | 0.0035739 |
| exercise | -0.2082229 | 0.1561606 | -1.3333889 | 0.1882161 |
lm(diabetes ~ percent_fastfood + racewhite_rate + age0to44_rate + poverty + smoking + exercise, combined_neighborhood) %>%
broom::tidy() %>%
kable()
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | 17.5350180 | 5.7068880 | 3.0726060 | 0.0033738 |
| percent_fastfood | 17.5639085 | 5.7784584 | 3.0395492 | 0.0037036 |
| racewhite_rate | -0.0778512 | 0.0169381 | -4.5962285 | 0.0000278 |
| age0to44_rate | 0.0056160 | 0.0639075 | 0.0878768 | 0.9303121 |
| poverty | 0.0443508 | 0.0488802 | 0.9073366 | 0.3684145 |
| smoking | 0.1527703 | 0.1035605 | 1.4751785 | 0.1461941 |
| exercise | -0.1465556 | 0.0652703 | -2.2453643 | 0.0290222 |
lm(stroke_hosp ~ percent_fastfood + racewhite_rate + age0to44_rate + poverty + smoking + exercise, combined_neighborhood) %>%
broom::tidy() %>%
kable()
| term | estimate | std.error | statistic | p.value |
|---|---|---|---|---|
| (Intercept) | -0.6792658 | 143.2162470 | -0.0047429 | 0.9962338 |
| percent_fastfood | 439.1079799 | 145.0123308 | 3.0280734 | 0.0038249 |
| racewhite_rate | -1.5938059 | 0.4250665 | -3.7495447 | 0.0004464 |
| age0to44_rate | 0.2610684 | 1.6037811 | 0.1627830 | 0.8713202 |
| poverty | 1.7935986 | 1.2266651 | 1.4621746 | 0.1497111 |
| smoking | 6.4527974 | 2.5988861 | 2.4829089 | 0.0162954 |
| exercise | 1.6711661 | 1.6379797 | 1.0202605 | 0.3123283 |
According to the results of multiple linear regressions, we can conclude that after adjusting for race, age, poverty, smoking status and exercise status, there is significant relationships between the three chronic disease outcomes and the percentage of fastfood restuarants within boroughs. (Obesity: p-value=0.000909491; Diabetes: p-value=3.756556e-03; Hospitalized Stroke: p-value=0.0039134849)
# removing punctuations in restaurant inspections
rest_neigh_str = rest_neighborhood %>%
mutate(dba = str_replace_all(dba, "[[:punct:]]", ""))
# Matching the two datasets(neighborhood restaurant inspection data that has all punctuation removed from dba(restaurant name) and list of chain restaurants by dba)
neighborhood_chain =
right_join(rest_neigh_str, chain_rest_str) %>%
filter(!is.na(camis)) %>%
distinct(camis, dba, neighborhood)
## Joining, by = "dba"
# counting chains in neighborhoods
neigh_count_chain = neighborhood_chain %>%
group_by(neighborhood) %>%
summarise(chain_n = n())
neigh_count_rest = rest_neighborhood %>%
distinct(neighborhood, camis) %>%
group_by(neighborhood) %>%
summarise(res_n = n())
# calculating percentage of chains in each boro
percent_neighborhood_chain = left_join(neigh_count_chain, neigh_count_rest) %>%
mutate(chain_percentage = chain_n/res_n)
## Joining, by = "neighborhood"
# joinining with health data
health_chain_neighborhood = left_join(percent_neighborhood_chain, health_only_neighborhood)
## Joining, by = "neighborhood"
## Warning: Column `neighborhood` joining character vector and factor,
## coercing into character vector
# function for modeling health outcome and percentage of chain restaurants
lm_neighborhood = function(health_outcome){
graphs_chain_health = health_chain_neighborhood %>%
mutate(boro = forcats::fct_reorder(neighborhood, health_outcome)) %>%
ggplot(aes(chain_percentage, health_outcome, color = neighborhood)) + geom_point() + theme(legend.position="none")
reg_chain = lm(health_outcome ~ chain_percentage + racewhite_rate + age0to44_rate + poverty + smoking + exercise, data = health_chain_neighborhood) %>%
summary()
list(graphs_chain_health, reg_chain)
# function that takes a input of column in health_chain_neighborhood and gives output of:
# 1. graphs_health will give a graph showing the relationship between health outcomes in neighborhoods and their percentage of chain restaurants
# 2.reg_chain gives an output of the linear regression result
}
lm_neighborhood(health_chain_neighborhood$obesity)
## [[1]]
##
## [[2]]
##
## Call:
## lm(formula = health_outcome ~ chain_percentage + racewhite_rate +
## age0to44_rate + poverty + smoking + exercise, data = health_chain_neighborhood)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.7420 -2.9724 -0.8885 3.7028 8.7493
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 27.43030 14.43133 1.901 0.06288 .
## chain_percentage 62.89506 23.41510 2.686 0.00968 **
## racewhite_rate -0.13297 0.04047 -3.286 0.00182 **
## age0to44_rate -0.08772 0.16061 -0.546 0.58729
## poverty 0.17525 0.12581 1.393 0.16956
## smoking 0.82845 0.25612 3.235 0.00212 **
## exercise -0.19105 0.16346 -1.169 0.24783
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 4.925 on 52 degrees of freedom
## Multiple R-squared: 0.6608, Adjusted R-squared: 0.6217
## F-statistic: 16.88 on 6 and 52 DF, p-value: 1.062e-10
# I tried log transformation and it's even better!
#health_chain_neighborhood %>%
# mutate(boro = forcats::fct_reorder(neighborhood, obesity)) %>%
# ggplot(aes(chain_percentage, obesity^2, color = neighborhood)) + geom_point() + #theme(legend.position="none")
#
#lm(obesity^2 ~ chain_percentage + racewhite_rate + age0to44_rate + poverty + smoking + #exercise, data = health_chain_neighborhood) %>%
# summary()
Fitting a regression line with predictors chain_percentage, racewhite_rate, age0to44_rate, poverty,smoking, and exercise, we can see that the percentage of chains is a significant predictor with p-value, 0.01057. Holding all the other predictors constant, 62.2% increase in percentage of obesity in NYC neighborhoods is associated with 1% increase in chain restaurants. From the graph of percentage of chain restaurants and percentage of obesity, we can see there is a positive trend.
lm_neighborhood(health_chain_neighborhood$diabetes)
## [[1]]
##
## [[2]]
##
## Call:
## lm(formula = health_outcome ~ chain_percentage + racewhite_rate +
## age0to44_rate + poverty + smoking + exercise, data = health_chain_neighborhood)
##
## Residuals:
## Min 1Q Median 3Q Max
## -4.0861 -1.1391 -0.1949 1.3621 4.4662
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 18.193485 5.976656 3.044 0.00366 **
## chain_percentage 22.349494 9.697233 2.305 0.02521 *
## racewhite_rate -0.087199 0.016759 -5.203 3.37e-06 ***
## age0to44_rate 0.005019 0.066517 0.075 0.94015
## poverty 0.065475 0.052105 1.257 0.21452
## smoking 0.179443 0.106069 1.692 0.09668 .
## exercise -0.140563 0.067697 -2.076 0.04282 *
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 2.04 on 52 degrees of freedom
## Multiple R-squared: 0.7572, Adjusted R-squared: 0.7292
## F-statistic: 27.03 on 6 and 52 DF, p-value: 2.31e-14
We can also visually see there is positive correlation between diabetes and percentage of chains. From the linear regression analysis, the results show that percentage of chain restaurants in neighborhoods is a significant predictor of percentage of diabetics in NYC neighborhoods
lm_neighborhood(health_chain_neighborhood$stroke_hosp)
## [[1]]
##
## [[2]]
##
## Call:
## lm(formula = health_outcome ~ chain_percentage + racewhite_rate +
## age0to44_rate + poverty + smoking + exercise, data = health_chain_neighborhood)
##
## Residuals:
## Min 1Q Median 3Q Max
## -134.604 -30.567 2.136 25.483 141.942
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 81.6054 155.8476 0.524 0.60277
## chain_percentage 255.5637 252.8656 1.011 0.31685
## racewhite_rate -1.9933 0.4370 -4.561 3.13e-05 ***
## age0to44_rate -0.1588 1.7345 -0.092 0.92739
## poverty 1.9123 1.3587 1.407 0.16524
## smoking 7.8010 2.7659 2.820 0.00677 **
## exercise 1.6262 1.7653 0.921 0.36119
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 53.19 on 52 degrees of freedom
## Multiple R-squared: 0.6404, Adjusted R-squared: 0.5989
## F-statistic: 15.43 on 6 and 52 DF, p-value: 4.58e-10
When fitting chain_percentage, racewhite_rate, age0to44_rate, poverty,smoking, and exercise on rate of stroke however, the outcomes indicates that chain_percentage is not a significant predictor at significance level 0.05.
data2 =
data1 %>%
mutate(neighborhood = str_to_upper(neighborhood))
combined1 =
left_join(data2, health_neighborhood, by = "neighborhood")
First, combine the data which shows the health data of each neighborhood with the dataset which shows the percentage of “grade-A” restaurants in each neighborhood. Then we acquire the final dataset, we can do the linear regression between obesity, diabetes and stroke with the percentage of “grade-A” restaurants, which we treat as a represent of the healthy restaurants in that neighborhood.
reg4 <- lm(obesity ~ grade_percent + racewhite_rate + poverty + unemployment + smoking + exercise + diabetes + stroke_hosp + age0to44_rate, data = combined1 )
summary(reg4)
##
## Call:
## lm(formula = obesity ~ grade_percent + racewhite_rate + poverty +
## unemployment + smoking + exercise + diabetes + stroke_hosp +
## age0to44_rate, data = combined1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -11.1700 -2.0304 -0.3021 2.0637 8.0918
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 9.51605 12.87776 0.739 0.46346
## grade_percent -17.55024 20.64471 -0.850 0.39940
## racewhite_rate 0.06956 0.04334 1.605 0.11494
## poverty -0.12597 0.11180 -1.127 0.26533
## unemployment 0.54499 0.30489 1.787 0.08004 .
## smoking 0.38736 0.20439 1.895 0.06398 .
## exercise -0.11180 0.12819 -0.872 0.38738
## diabetes 1.25042 0.24965 5.009 7.51e-06 ***
## stroke_hosp 0.02980 0.01091 2.730 0.00876 **
## age0to44_rate -0.06777 0.12009 -0.564 0.57514
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 3.616 on 49 degrees of freedom
## Multiple R-squared: 0.8277, Adjusted R-squared: 0.796
## F-statistic: 26.15 on 9 and 49 DF, p-value: 8.825e-16
broom::tidy(reg4)
## term estimate std.error statistic p.value
## 1 (Intercept) 9.51605092 12.87775730 0.7389525 4.634610e-01
## 2 grade_percent -17.55023512 20.64471083 -0.8501081 3.993999e-01
## 3 racewhite_rate 0.06956019 0.04334226 1.6049045 1.149415e-01
## 4 poverty -0.12597457 0.11180228 -1.1267621 2.653310e-01
## 5 unemployment 0.54498834 0.30489039 1.7874894 8.004392e-02
## 6 smoking 0.38735698 0.20439387 1.8951496 6.398006e-02
## 7 exercise -0.11180414 0.12819321 -0.8721534 3.873783e-01
## 8 diabetes 1.25041638 0.24965424 5.0085927 7.511381e-06
## 9 stroke_hosp 0.02980073 0.01091430 2.7304293 8.764716e-03
## 10 age0to44_rate -0.06776658 0.12009361 -0.5642813 5.751367e-01
p-value is smaller than 0.05, reject the null, meaning that there is a linear association between the obesity and the percentage of grade A restaurant in each neighborhood.
reg5 <- lm(diabetes ~ grade_percent + racewhite_rate + poverty + unemployment + smoking + exercise + obesity + stroke_hosp + age0to44_rate, data = combined1 )
summary(reg5)
##
## Call:
## lm(formula = diabetes ~ grade_percent + racewhite_rate + poverty +
## unemployment + smoking + exercise + obesity + stroke_hosp +
## age0to44_rate, data = combined1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -3.4308 -1.0375 -0.0579 0.7548 4.3168
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) 7.389110 5.932974 1.245 0.218896
## grade_percent 15.613823 9.417344 1.658 0.103710
## racewhite_rate -0.071473 0.017999 -3.971 0.000234 ***
## poverty 0.057859 0.052046 1.112 0.271702
## unemployment -0.147801 0.144908 -1.020 0.312755
## smoking -0.029270 0.098453 -0.297 0.767497
## exercise -0.083995 0.058908 -1.426 0.160248
## obesity 0.270795 0.054066 5.009 7.51e-06 ***
## stroke_hosp -0.001295 0.005449 -0.238 0.813060
## age0to44_rate 0.011646 0.056044 0.208 0.836247
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 1.683 on 49 degrees of freedom
## Multiple R-squared: 0.8442, Adjusted R-squared: 0.8156
## F-statistic: 29.51 on 9 and 49 DF, p-value: < 2.2e-16
broom::tidy(reg5)
## term estimate std.error statistic p.value
## 1 (Intercept) 7.38910959 5.932973952 1.2454310 2.188962e-01
## 2 grade_percent 15.61382328 9.417343967 1.6579859 1.037096e-01
## 3 racewhite_rate -0.07147265 0.017998881 -3.9709496 2.341093e-04
## 4 poverty 0.05785864 0.052046265 1.1116770 2.717019e-01
## 5 unemployment -0.14780087 0.144907738 -1.0199654 3.127545e-01
## 6 smoking -0.02926982 0.098453156 -0.2972969 7.674966e-01
## 7 exercise -0.08399495 0.058908112 -1.4258638 1.602476e-01
## 8 obesity 0.27079499 0.054066083 5.0085927 7.511381e-06
## 9 stroke_hosp -0.00129547 0.005448697 -0.2377578 8.130603e-01
## 10 age0to44_rate 0.01164590 0.056043871 0.2077997 8.362466e-01
p-value smaller than 0.05, reject the null, meaning that there is a linear association between the diabetes and the percentage of grade A restaurant in each neighborhood.
reg6 <- lm(stroke_hosp ~ grade_percent + racewhite_rate + poverty + unemployment + smoking + exercise + diabetes + obesity + age0to44_rate, data = combined1 )
summary(reg6)
##
## Call:
## lm(formula = stroke_hosp ~ grade_percent + racewhite_rate + poverty +
## unemployment + smoking + exercise + diabetes + obesity +
## age0to44_rate, data = combined1)
##
## Residuals:
## Min 1Q Median 3Q Max
## -96.313 -25.002 2.209 22.423 129.730
##
## Coefficients:
## Estimate Std. Error t value Pr(>|t|)
## (Intercept) -10.7866 157.8985 -0.068 0.94581
## grade_percent -260.9177 250.8403 -1.040 0.30336
## racewhite_rate -0.7002 0.5329 -1.314 0.19503
## poverty -0.7829 1.3763 -0.569 0.57209
## unemployment 8.6891 3.6309 2.393 0.02058 *
## smoking 3.5201 2.5327 1.390 0.17086
## exercise 1.8603 1.5527 1.198 0.23665
## diabetes -0.8895 3.7412 -0.238 0.81306
## obesity 4.4313 1.6229 2.730 0.00876 **
## age0to44_rate 1.1746 1.4596 0.805 0.42485
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
##
## Residual standard error: 44.1 on 49 degrees of freedom
## Multiple R-squared: 0.767, Adjusted R-squared: 0.7242
## F-statistic: 17.92 on 9 and 49 DF, p-value: 1.106e-12
broom::tidy(reg6)
## term estimate std.error statistic p.value
## 1 (Intercept) -10.7866300 157.8984788 -0.0683137 0.945813925
## 2 grade_percent -260.9177119 250.8403054 -1.0401746 0.303364248
## 3 racewhite_rate -0.7001605 0.5329306 -1.3137932 0.195033842
## 4 poverty -0.7828686 1.3763477 -0.5688014 0.572088891
## 5 unemployment 8.6891360 3.6308590 2.3931351 0.020581132
## 6 smoking 3.5200670 2.5327013 1.3898469 0.170857798
## 7 exercise 1.8602912 1.5527164 1.1980882 0.236645761
## 8 diabetes -0.8894987 3.7411965 -0.2377578 0.813060320
## 9 obesity 4.4312954 1.6229299 2.7304293 0.008764716
## 10 age0to44_rate 1.1745939 1.4595775 0.8047493 0.424851956
Surprise! We get an insignificant variable! P-value is higher than 0.05, suggesting there is no linear association between stroke and the percentage of grade A restaurant in each neighborhood.
write
write